In this competition we’re given around 30 million (simulated) check-ins on Facebook in a 10km by 10km grid. The goal is to build a model that predicts what business a user checks into based on spatial and temporal information. The tricky part here is that there are around 100k different classes(place_id) so most supervised learning techniques won’t work on the entire dataset. However most classes are clustered in only certain parts of the grid so the idea we will pursue here is to select a small-ish square within the grid and try to see if we can do better within the small square. First we will do some exploratory data analysis in the smaller square then we will use a random forest algorithm for prediction and finally, we will analyze the results.

Read and Clean Data:

Load the required packages and read in the data:

library(data.table)
library(dplyr)
library(ggplot2)
library(ranger)
library(plotly)
library(tidyr)
library(FNN)
library(xgboost)
library(corrplot)
library(gridExtra)

train = fread("train.csv", integer64 = "character", showProgress = FALSE)
summary(train)
##      row_id               x                y             accuracy      
##  Min.   :       0   Min.   : 0.000   Min.   : 0.000   Min.   :   1.00  
##  1st Qu.: 7279505   1st Qu.: 2.535   1st Qu.: 2.497   1st Qu.:  27.00  
##  Median :14559010   Median : 5.009   Median : 4.988   Median :  62.00  
##  Mean   :14559010   Mean   : 5.000   Mean   : 5.002   Mean   :  82.85  
##  3rd Qu.:21838515   3rd Qu.: 7.461   3rd Qu.: 7.510   3rd Qu.:  75.00  
##  Max.   :29118020   Max.   :10.000   Max.   :10.000   Max.   :1033.00  
##       time          place_id        
##  Min.   :     1   Length:29118021   
##  1st Qu.:203057   Class :character  
##  Median :433922   Mode  :character  
##  Mean   :417010                     
##  3rd Qu.:620491                     
##  Max.   :786239
#check missing values
which(is.na(train))
## integer(0)
sum(duplicated(train))
## [1] 0

We can see that there’s no missing value nd duplicates in the dataset

Now we’ll select a subset of the data, because the dataset is extremely large. I’ll pick a random 250 meters by 250 meters square in our imaginary Facebook city.

train %>% filter(x >1, x <1.25, y >2.5, y < 2.75) -> train
head(train, 3)
##    row_id      x      y accuracy   time   place_id
## 1:    600 1.2214 2.7023       17  65380 6683426742
## 2:    957 1.1832 2.6891       58 785470 6683426742
## 3:   4345 1.1935 2.6550       11 400082 6889790653

EDA(Exploratory DATA Analysis):

week = 60 * 24 * 7
train$week <- floor(train$time / week)
weekstats <- train %>% group_by(week) %>%
  summarize(
    acc_mean = mean(accuracy), checkins = length(row_id)
  )

  acc_byWeek <- ggplot(weekstats, aes(x = week, y = acc_mean)) + geom_line(size=2, col=1) + labs(y="Mean Accuracy")
  acc_dens <- ggplot(train, aes(x = accuracy)) + geom_density()
  grid.arrange(acc_byWeek, acc_dens, ncol=2)

We do not know these peaks in mean accuracy either in time or in the distribution, Some seem to be related. Thus, we will look at how the accuracy by week compares to the number of check in’s per week. We will normalize each variable to show it on the same axis:

weekstats <- weekstats %>% scale %>% as.data.frame %>% melt(id="week")
## Warning in melt(., id = "week"): The melt generic in data.table has been passed
## a data.frame and will attempt to redirect to the relevant reshape2 method;
## please note that reshape2 is deprecated, and this redirection is now deprecated
## as well. To continue using melt methods from reshape2 while both libraries are
## attached, e.g. melt.list, you can prepend the namespace like reshape2::melt(.).
## In the next version, this warning will become an error.
ggplot(weekstats, aes(x = week, y = value, color=variable)) + geom_line(size=2) + scale_color_discrete(breaks=c("checkins","acc_mean"), labels=c("Number of Checkins", "Mean Accuracy")) + ylab("normalized value")

Time is only given simply as a numeric value. For the purpose of clear observation and understanding, let’s extract a new feature called hour that gives the hour in the day (from 0 to 24). At the same time, extract (approximations) of other time units such as weekday, month and year.

train$hour = (train$time/60) %% 24
train$weekday = (train$time/(60*24)) %% 7
train$month = (train$time/(60*24*30)) %% 12
train$year = train$time/(60*24*365)
train$day = train$time/(60*24) %% 365

We will use holdup sample method to split our dataset into a training and validating set so we can check the results. We choose the validation set to be the more recent check-ins so that our validation structure is similar to the one on the test set.

small_train = train[train$time < 7.3e5,]
small_val = train[train$time >= 7.3e5,] 

Distribution of data grouped by place_id:

ggplot(small_train, aes(x, y )) +
    geom_point(aes(color = place_id)) + 
    theme_minimal() +
    theme(legend.position = "none") +
    ggtitle("Check-ins colored by place_id")

Now we can see the rough distribution of the data sorted by place_id. But some of the groups are overlapping and it is hard to identify certain characteristics. We will use the hour component as the third variable and to juse have a look at the most popular clusters to get a more reasonable observation.

small_train %>% count(place_id) %>% filter(n > 500) -> ids
small_trainz = small_train[small_train$place_id %in% ids$place_id,]
summary(small_trainz)
##      row_id               x               y            accuracy     
##  Min.   :     600   Min.   :1.000   Min.   :2.518   Min.   :  1.00  
##  1st Qu.: 7318538   1st Qu.:1.033   1st Qu.:2.597   1st Qu.: 20.00  
##  Median :14323301   Median :1.100   Median :2.669   Median : 63.00  
##  Mean   :14496397   Mean   :1.104   Mean   :2.653   Mean   : 76.43  
##  3rd Qu.:21592582   3rd Qu.:1.166   3rd Qu.:2.702   3rd Qu.: 74.00  
##  Max.   :29112154   Max.   :1.250   Max.   :2.750   Max.   :992.00  
##       time          place_id              week           hour       
##  Min.   :   476   Length:5683        Min.   : 0.0   Min.   : 0.000  
##  1st Qu.:130170   Class :character   1st Qu.:12.0   1st Qu.: 6.817  
##  Median :340766   Mode  :character   Median :33.0   Median :11.000  
##  Mean   :345766                      Mean   :33.8   Mean   :11.450  
##  3rd Qu.:558315                      3rd Qu.:55.0   3rd Qu.:15.900  
##  Max.   :729777                      Max.   :72.0   Max.   :23.983  
##     weekday             month                year                day         
##  Min.   :0.000694   Min.   : 0.000532   Min.   :0.0009056   Min.   :   1.38  
##  1st Qu.:1.932639   1st Qu.: 1.578287   1st Qu.:0.2476598   1st Qu.: 377.30  
##  Median :3.380556   Median : 3.243519   Median :0.6483371   Median : 987.73  
##  Mean   :3.488683   Mean   : 4.253711   Mean   :0.6578501   Mean   :1002.22  
##  3rd Qu.:5.240625   3rd Qu.: 6.669421   3rd Qu.:1.0622432   3rd Qu.:1618.30  
##  Max.   :6.999306   Max.   :11.999375   Max.   :1.3884646   Max.   :2115.30
plot_ly(data = small_trainz, x = ~x , y = ~y , z = ~hour, color = ~place_id,  type = "scatter3d", mode = "markers", marker=list(size= 5)) %>% layout(title = "Place_id's by position and Time of Day")

We can see that adding the time dimension definitely helps. The daily cycles are clearly visible above - for certain places the check-in’s stop for a few hours and then start peaking up again. Other businesses have quite a few peaks throughout the day, and the peaks tend to be rather different for different businesses.

Construct another one at the day of week:

plot_ly(data = small_trainz, x = ~x , y = ~y, z = ~weekday, color = ~place_id,  type = "scatter3d", mode = "markers", marker=list(size= 5)) %>% layout(title = "Place_id's by position and Day of Week")

There is some variation by day of week (perhaps some businesses are busier on the weekend) but the most visible trend is still the day cycles.

However we still might have too many classes for something like random forest to work at its best.

length(unique(small_train$place_id))
## [1] 770

Then, we will remove the place_id’s that have only three or less occurrences in the city are we picked. This will decrease the number of classes by a lot. Since we have a validation set we can always come back and change the filter level to see if we get better results.

small_train %>% count(place_id) %>% filter(n > 3) -> ids
small_train = small_train[small_train$place_id %in% ids$place_id,]

Now we have 15595 training examples and 224 classes and we’re ready to do some machine learning!

K Nearest Neighbors

KNN can be used for classification in a supervised setting where we are given a dataset with target labels. For classification, KNN finds the k nearest data points in the training set and the target label is computed as the mode of the target label of these k nearest neighbors.

WHY KNN? * Simple to implement and intuitive to understand * Can learn non-linear decision boundaries when used for classification and regression. Can came up with a highly flexible decision boundary adjusting the value of K. * No Training Time for classification/regression : The KNN algorithm has no explicit training step and all the work happens during prediction * Constantly evolves with new data: Since there is no explicit training step, as we keep adding new data to the dataset, the prediction is adjusted without having to retrain a new model. * Single Hyper parameters: There is a single hyper parameter, the value of K. This makes hyper parameter tuning easy. * Choice of distance metric: There are many distance metrics to chose from. Some popular distance metrics used are Euclidean, Manhattan, Minkowski, hamming distance eand so on.

s = 2
l = 125
w = 500

create_matrix = function(train) {
    cbind(s*train$y,
          train$x,
          train$hour/l,
          train$weekday/w,
          train$year/w,
          train$month/w,
          train$time/(w*60*24*7))
    }

X = create_matrix(small_train)
X_val = create_matrix(small_val)

Now we will build the knn model

model_knn = FNN::knn(train = X, test = X_val, cl = small_train$place_id, k = 15)

preds <- as.character(model_knn)
truth <- as.character(small_val$place_id)
mean(truth == preds)
## [1] 0.5151964

We can see that knn did not perform very well in this circumstance, let’s try to use Random Forest instead.

Random Forest:

set.seed(2022)
small_train$place_id <- as.factor(small_train$place_id) # ranger needs factors for classification
model_rf <- ranger(place_id ~ x + y + accuracy + hour + weekday + month + year,
                   small_train,
                   num.trees = 100,
                   write.forest = TRUE,
                   importance = "impurity")
## Growing trees.. Progress: 64%. Estimated remaining time: 17 seconds.
pred = predict(model_rf, small_val)
pred = pred$predictions
accuracy = mean(pred == small_val$place_id)

We get an accuracy of 0.5411416. The accuracy improves compared with knn, but it is not that good. Because the evaluation metric for this competition is mean average precision at 3 so predicting votes/probabilities by class and then counting the top three id’s is guaranteed to improve our score. But for simplicity we’ll just stick to accuracy.

Then we will take a look at the predictions on the validation set:

small_val$Correct = (pred == small_val$place_id)

ggplot(small_val, aes(x, y )) +
    geom_point(aes(color = Correct)) + 
    theme_minimal() +
    scale_color_brewer(palette = "Set1")

It does seem that the correctly identified check-ins are more “clustered” while the wrongly identified ones are more uniformly distributed but other than that no clear patters here. Let’s also take a look at what kind of id’s the random forest model gets wrong. To do this we will look at accuracy by id and also plot the id’s based on how often they appear in the validation set. We see below that our model is doing actually really great on the more popular id’s(more blue on the right). However it loses when it looks at id’s that appear only a few times.

#reordering the levels based on counts:
small_val$place_id <- factor(small_val$place_id,
                             levels = names(sort(table(small_val$place_id), decreasing = TRUE)))

small_val %>% 
    ggplot(aes(x = place_id)) + geom_bar(aes(fill = Correct)) + 
    theme_minimal() +
    theme(axis.text.x = element_blank()) +
    ggtitle("Prediction Accuracy by ID and Popularity") +
    scale_fill_brewer(palette = "Set1")

We see above that our model is doing actually really good on the more popular id’s. However it loses when it looks at id’s that appear only a few times.

Let’s look at the importance of our variables as well:

data.frame(as.list(model_rf$variable.importance)) %>% gather() %>% 
    ggplot(aes(x = reorder(key, value), y = value)) +
    geom_bar(stat = "identity", width = 0.6, fill = "grey") +
    coord_flip() +
    theme_minimal() +
    ggtitle("Variable Importance (Gini Index)") +
    theme(axis.title.y = element_blank()) 

All the y variable is more important than the x coordinate. This means that the y axis is a better predictor of place_id and the random forest figures this out on its own. hour and other time features are also good predictors but less so than the spatial features, this makes sense since the location of a check-in should be more important than the time of the check-in. And lastly , it is interesting that we can see accuracy is not that important here.